In [1]:
import pandas as pd
import pandas_profiling
import numpy as np
import seaborn as sns
import sklearn
import missingno as msno 
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.classifier import ROCAUC
import sklearn.metrics
from sklearn.metrics import confusion_matrix, roc_auc_score
pd.set_option('display.max_columns', 500)
from imblearn.over_sampling import SMOTE 

Cervical Cancer Risk Factors for Biopsy: This Dataset is Obtained from UCI Repository and kindly acknowledged!

This file contains a List of Risk Factors for Cervical Cancer leading to a Biopsy Examination!

About 11,000 new cases of invasive cervical cancer are diagnosed each year in the U.S. However, the number of new cervical cancer cases has been declining steadily over the past decades. Although it is the most preventable type of cancer, each year cervical cancer kills about 4,000 women in the U.S. and about 300,000 women worldwide. In the United States, cervical cancer mortality rates plunged by 74% from 1955 - 1992 thanks to increased screening and early detection with the Pap test. AGE Fifty percent of cervical cancer diagnoses occur in women ages 35 - 54, and about 20% occur in women over 65 years of age. The median age of diagnosis is 48 years. About 15% of women develop cervical cancer between the ages of 20 - 30. Cervical cancer is extremely rare in women younger than age 20. However, many young women become infected with multiple types of human papilloma virus, which then can increase their risk of getting cervical cancer in the future. Young women with early abnormal changes who do not have regular examinations are at high risk for localized cancer by the time they are age 40, and for invasive cancer by age 50. SOCIOECONOMIC AND ETHNIC FACTORS Although the rate of cervical cancer has declined among both Caucasian and African-American women over the past decades, it remains much more prevalent in African-Americans -- whose death rates are twice as high as Caucasian women. Hispanic American women have more than twice the risk of invasive cervical cancer as Caucasian women, also due to a lower rate of screening. These differences, however, are almost certainly due to social and economic differences. Numerous studies report that high poverty levels are linked with low screening rates. In addition, lack of health insurance, limited transportation, and language difficulties hinder a poor woman’s access to screening services. HIGH SEXUAL ACTIVITY Human papilloma virus (HPV) is the main risk factor for cervical cancer. In adults, the most important risk factor for HPV is sexual activity with an infected person. Women most at risk for cervical cancer are those with a history of multiple sexual partners, sexual intercourse at age 17 years or younger, or both. A woman who has never been sexually active has a very low risk for developing cervical cancer. Sexual activity with multiple partners increases the likelihood of many other sexually transmitted infections (chlamydia, gonorrhea, syphilis).Studies have found an association between chlamydia and cervical cancer risk, including the possibility that chlamydia may prolong HPV infection. FAMILY HISTORY Women have a higher risk of cervical cancer if they have a first-degree relative (mother, sister) who has had cervical cancer. USE OF ORAL CONTRACEPTIVES Studies have reported a strong association between cervical cancer and long-term use of oral contraception (OC). Women who take birth control pills for more than 5 - 10 years appear to have a much higher risk HPV infection (up to four times higher) than those who do not use OCs. (Women taking OCs for fewer than 5 years do not have a significantly higher risk.) The reasons for this risk from OC use are not entirely clear. Women who use OCs may be less likely to use a diaphragm, condoms, or other methods that offer some protection against sexual transmitted diseases, including HPV. Some research also suggests that the hormones in OCs might help the virus enter the genetic material of cervical cells. HAVING MANY CHILDREN Studies indicate that having many children increases the risk for developing cervical cancer, particularly in women infected with HPV. SMOKING Smoking is associated with a higher risk for precancerous changes (dysplasia) in the cervix and for progression to invasive cervical cancer, especially for women infected with HPV. IMMUNOSUPPRESSION Women with weak immune systems, (such as those with HIV / AIDS), are more susceptible to acquiring HPV. Immunocompromised patients are also at higher risk for having cervical precancer develop rapidly into invasive cancer. DIETHYLSTILBESTROL (DES) From 1938 - 1971, diethylstilbestrol (DES), an estrogen-related drug, was widely prescribed to pregnant women to help prevent miscarriages. The daughters of these women face a higher risk for cervical cancer. DES is no longer prsecribed.

Leitura dataset

In [2]:
df = pd.read_csv(r'C:\Users\gebra\Desktop\DS\Dataset\Cancer\kag_risk_factors_cervical_cancer.csv')
In [3]:
df.head()
Out[3]:
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years) STDs STDs (number) STDs:condylomatosis STDs:cervical condylomatosis STDs:vaginal condylomatosis STDs:vulvo-perineal condylomatosis STDs:syphilis STDs:pelvic inflammatory disease STDs:genital herpes STDs:molluscum contagiosum STDs:AIDS STDs:HIV STDs:Hepatitis B STDs:HPV STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN Dx:HPV Dx Hinselmann Schiller Citology Biopsy
0 18 4.0 15.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
1 15 1.0 14.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
2 34 1.0 ? 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0
3 52 5.0 16.0 4.0 1.0 37.0 37.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 1 0 1 0 0 0 0 0
4 46 3.0 21.0 4.0 0.0 0.0 0.0 1.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 ? ? 0 0 0 0 0 0 0 0

Subistituir '?' por nan

In [4]:
df = df.replace('?', np.nan)

Visualizar valores duplicados

In [5]:
df.duplicated().value_counts()
Out[5]:
False    835
True      23
dtype: int64
In [6]:
df[df.duplicated()].head()
Out[6]:
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years) STDs STDs (number) STDs:condylomatosis STDs:cervical condylomatosis STDs:vaginal condylomatosis STDs:vulvo-perineal condylomatosis STDs:syphilis STDs:pelvic inflammatory disease STDs:genital herpes STDs:molluscum contagiosum STDs:AIDS STDs:HIV STDs:Hepatitis B STDs:HPV STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN Dx:HPV Dx Hinselmann Schiller Citology Biopsy
66 34 3.0 19.0 3.0 0.0 0.0 0.0 1.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 NaN NaN 0 0 0 0 0 0 0 0
234 25 NaN 18.0 2.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN 0 0 0 0 0 0 0 0
255 25 2.0 18.0 2.0 0.0 0.0 0.0 1.0 0.25 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 NaN NaN 0 0 0 0 0 0 0 0
356 18 1.0 17.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 NaN NaN 0 0 0 0 0 0 0 0
395 18 1.0 18.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 NaN NaN 0 0 0 0 0 0 0 0
In [7]:
df[(df['Number of sexual partners'] == 3.0) & (df['Age'] == 34.0) ].head()
Out[7]:
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years) STDs STDs (number) STDs:condylomatosis STDs:cervical condylomatosis STDs:vaginal condylomatosis STDs:vulvo-perineal condylomatosis STDs:syphilis STDs:pelvic inflammatory disease STDs:genital herpes STDs:molluscum contagiosum STDs:AIDS STDs:HIV STDs:Hepatitis B STDs:HPV STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN Dx:HPV Dx Hinselmann Schiller Citology Biopsy
In [8]:
df.drop_duplicates(inplace = True)

Visualizar a quantidade de valores missing de cada coluna

In [9]:
msno.bar(df)
Out[9]:
<AxesSubplot:>

Converter o dtype para float

In [10]:
df = df.astype('float32')
In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 835 entries, 0 to 857
Data columns (total 36 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Age                                 835 non-null    float32
 1   Number of sexual partners           810 non-null    float32
 2   First sexual intercourse            828 non-null    float32
 3   Num of pregnancies                  779 non-null    float32
 4   Smokes                              822 non-null    float32
 5   Smokes (years)                      822 non-null    float32
 6   Smokes (packs/year)                 822 non-null    float32
 7   Hormonal Contraceptives             732 non-null    float32
 8   Hormonal Contraceptives (years)     732 non-null    float32
 9   IUD                                 723 non-null    float32
 10  IUD (years)                         723 non-null    float32
 11  STDs                                735 non-null    float32
 12  STDs (number)                       735 non-null    float32
 13  STDs:condylomatosis                 735 non-null    float32
 14  STDs:cervical condylomatosis        735 non-null    float32
 15  STDs:vaginal condylomatosis         735 non-null    float32
 16  STDs:vulvo-perineal condylomatosis  735 non-null    float32
 17  STDs:syphilis                       735 non-null    float32
 18  STDs:pelvic inflammatory disease    735 non-null    float32
 19  STDs:genital herpes                 735 non-null    float32
 20  STDs:molluscum contagiosum          735 non-null    float32
 21  STDs:AIDS                           735 non-null    float32
 22  STDs:HIV                            735 non-null    float32
 23  STDs:Hepatitis B                    735 non-null    float32
 24  STDs:HPV                            735 non-null    float32
 25  STDs: Number of diagnosis           835 non-null    float32
 26  STDs: Time since first diagnosis    71 non-null     float32
 27  STDs: Time since last diagnosis     71 non-null     float32
 28  Dx:Cancer                           835 non-null    float32
 29  Dx:CIN                              835 non-null    float32
 30  Dx:HPV                              835 non-null    float32
 31  Dx                                  835 non-null    float32
 32  Hinselmann                          835 non-null    float32
 33  Schiller                            835 non-null    float32
 34  Citology                            835 non-null    float32
 35  Biopsy                              835 non-null    float32
dtypes: float32(36)
memory usage: 123.9 KB
In [12]:
df.head()
Out[12]:
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD IUD (years) STDs STDs (number) STDs:condylomatosis STDs:cervical condylomatosis STDs:vaginal condylomatosis STDs:vulvo-perineal condylomatosis STDs:syphilis STDs:pelvic inflammatory disease STDs:genital herpes STDs:molluscum contagiosum STDs:AIDS STDs:HIV STDs:Hepatitis B STDs:HPV STDs: Number of diagnosis STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN Dx:HPV Dx Hinselmann Schiller Citology Biopsy
0 18.0 4.0 15.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 15.0 1.0 14.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 34.0 1.0 NaN 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 52.0 5.0 16.0 4.0 1.0 37.0 37.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 46.0 3.0 21.0 4.0 0.0 0.0 0.0 1.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Visualizar a coluna Biopsias

In [13]:
df['Biopsy'].value_counts()
Out[13]:
0.0    781
1.0     54
Name: Biopsy, dtype: int64
In [14]:
UTI = sorted(df['Biopsy'].unique())
uti = []
In [15]:
for k in UTI:
    soma = df[(df['Biopsy'] == k)]
    uti.append(len(soma))
In [16]:
x = np.arange(len(uti))
labels = ['N-Biopsy', 'Biopsy']

plt.figure(figsize = (15,10))


plt.bar(x, uti, color=['g', 'r'])

ax = plt.gca()
plt.xticks(x, labels)

plt.legend()
plt.show()
No handles with labels found to put in legend.
In [17]:
fig = px.bar(x = x, y = uti)
fig.update_layout(
    title="Numero de Biopsias",
    xaxis_title="Resultado Biopsia",
    yaxis_title="Numero total",
    legend_title="Legend Title"
    )

fig.show()
In [18]:
import plotly.graph_objects as go

colors = ['green', 'red']

fig = go.Figure([go.Bar(x=x,y=uti, marker_color=colors,  width=[0.5, 0.5])])
fig.update_layout(
    title="Numero de Biopsias",
    xaxis_title="Resultado Biopsia",
    yaxis_title="Numero total",
    legend_title="Legend Title"
    )

fig.show()

A coluna target está desbalanceada.

Tratando as duas colunas com a menor quantidade de dados

In [ ]:
x = df["STDs: Time since last diagnosis"].notna() & df['Biopsy'] ==1
x.value_counts()
In [ ]:
x = df["STDs: Time since first diagnosis"].notna() & df['Biopsy'] ==1
x.value_counts()
In [ ]:
colunas = ['STDs: Time since first diagnosis', 'STDs: Time since last diagnosis']
In [ ]:
df = df.drop(colunas, axis = 1)

Neste caso as duas colunas foram dropadas, pois possuem poucos valores preenchidos e não tem relação com a target

In [ ]:
msno.bar(df)
In [ ]:
x = df["STDs:HPV"].notna() & df['Biopsy'] ==1
x.value_counts()

Preenchendo valores missing com a mediana

In [20]:
mediana = df.median()
df = df.fillna(mediana)
In [21]:
msno.bar(df)
Out[21]:
<AxesSubplot:>

Dividir em X, y para poder divir em treino e teste

In [22]:
X = df.drop(columns="Biopsy")
In [23]:
y = df['Biopsy']

Tratamento de classe desbalanceada

In [24]:
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)

Divisão em treino e teste

In [25]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_res, y_res, test_size=0.3, random_state=42)
In [26]:
modelos = [DummyClassifier,
    LogisticRegression,
    DecisionTreeClassifier,
    KNeighborsClassifier,
    GaussianNB,
    RandomForestClassifier,
    SVC,
    ]
In [27]:
modelos2 = ["DummyClassifier",
    "LogisticRegression",
    "DecisionTreeClassifier",
    "KNeighborsClassifier",
    "GaussianNB",
    "RandomForestClassifier",
    "SVC"]
In [28]:
k = 0
for modelo in modelos:
    clf = modelo()
    clf.fit(X_train, y_train)
    resultado = clf.score(X_test, y_test, sample_weight=None)
    print (modelos2[k],":",resultado)
    k += 1
C:\Users\gebra\Anaconda3\envs\ds\lib\site-packages\sklearn\dummy.py:132: FutureWarning:

The default value of strategy will change from stratified to prior in 0.24.

C:\Users\gebra\Anaconda3\envs\ds\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

DummyClassifier : 0.4904051172707889
LogisticRegression : 0.9445628997867804
DecisionTreeClassifier : 0.9658848614072495
KNeighborsClassifier : 0.9104477611940298
GaussianNB : 0.5266524520255863
RandomForestClassifier : 0.9786780383795309
SVC : 0.814498933901919
In [29]:
k = 0
for modelo in modelos:
    clf = modelo()
    clf.fit(X_train, y_train)
    prev = clf.predict(X_test)
    print (modelos2[k], ":", "\n", confusion_matrix(y_test, prev), "\n")
    k += 1
DummyClassifier : 
 [[126 103]
 [126 114]] 

LogisticRegression : 
 [[216  13]
 [ 13 227]] 

DecisionTreeClassifier : 
 [[224   5]
 [ 13 227]] 

KNeighborsClassifier : 
 [[194  35]
 [  7 233]] 

GaussianNB : 
 [[  7 222]
 [  0 240]] 

C:\Users\gebra\Anaconda3\envs\ds\lib\site-packages\sklearn\dummy.py:132: FutureWarning:

The default value of strategy will change from stratified to prior in 0.24.

C:\Users\gebra\Anaconda3\envs\ds\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

RandomForestClassifier : 
 [[223   6]
 [  5 235]] 

SVC : 
 [[215  14]
 [ 73 167]] 

O modelo apresenta uma boa acuracia e classifica super bem como é possivel visualizar com a matriz de confusão nos modelos RandomForestClassifier e DecisionTreeClassifier